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(57) ABSTRACT 

An apparatus and method for determining if a query docu- 
ment matches one or more of a plurality of documents in a 
database. In a coarse matching stage, a compressed file or 
other query document is scanned to produce a bit profile. 
Global statistics such as line spacing and text height are 
calculated from the bit profile and used to narrow the field 
of documents to be searched in an image database. The bit 
profile is cross-correlated with bit profiles of documents in 
the search space to identify candidates for a detailed match- 
ing stage. If multiple candidates are generated in the coarse 
matching stage, a set of endpoint features is extracted from 
the query document for detailed matching in the detailed 
matching stage. Endpoint features contain sufficient infor- 
mation for various levels of processing, including page skew 
and orientation estimation. In addition, endpoint features are 
stable, symmetric and easily computable from commonly 
used compressed files including, but not limited to, CCTTT 
Group 4 compressed files. Endpoint features extracted in the 
detailed matching stage are used to correctly identify a 
matching document in a high percentage of cases. 

17 Claims, 9 Drawing Sheets 
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COMPRESSED DOCUMENT MATCHING 

FIELD OF THE INVENTION 

The present invention relates to the field of document 
management, and more particularly to detecting duplicate 
documents. 

BACKGROUND OF THE INVENTION 

With the increased ease of creating and transmitting 
electronic document images, it has become common for 
document images to be maintained in database systems that 
include automated document insertion and retrieval utilities. 
Consequently, it has become increasingly important to be 
able to efficiently and reliably determine whether a duplicate 
of a document submitted for insertion is already present in 
a database. Otherwise, duplicate documents will be stored in 
the database, needlessly consuming precious storage space. 
Determining whether a database contains a duplicate of a 
document is referred to as document matching. 

In currently available image-content based retrieval 
systems, color, texture and shape features are frequently 
used for document matching. Matching document images 
that are mostly bitonal and similar in shape and texture poses 
different problems. 

A common document matching technique is to perform 
optical character recognition (OCR) followed by a text 
based search. Another approach is to analyze the layout of 
the document and look for structurally similar documents in 
the database. Unfortunately, both of these approaches 
require computationally intensive page analysis. One way to 
reduce the computational analysis is to embed specially 
designed markers in the documents, that the documents can 
be reliably identified. 

Recently, alternatives to the text based approach have 
been developed by extracting features directly from images, 
with the goal of achieving efficiency and robustness over 
OCR. An example of such a feature is word length. Using 
sequences of word lengths in documents as indexes, match- 
ing documents may be identified by comparing the number 
of hits in each of the images generated by the query. Another 
approach is to map alphabetic characters to a small set of 
character shape codes (CSCs) which can be used to compile 
search keys for ASCII text retrieval. CSCs can also be 
obtained from text images based on the relative positions of 
connected components to baselines and x-height lines. In 
this way CSCs can be used for word spotting in document 
images. The application of CSCs has been extended to 
document duplicate detection by constructing multiple 
indexes using short sequences of CSCs extracted from the 
first line of text of sufficient length. 

A significant disadvantage of the above-described 
approaches is that they are inherently text line based. Line, 
word or even character segmentation must usually be per- 
formed. In one non-text -based approach, duplicate detection 
is based on horizontal projection profiles. The distance 
between wavelet coefficient vectors of the profiles represents 
document similarity. This technique may out-perform the 
text-based approach on degraded documents and documents 
with small amounts of text. 

Because the majority of document images in databases are 
stored in compressed formats, it is advantageous to perform 
document matching on compressed files. This eliminates the 
need for decompression and recompression and makes com- 
mercialization more feasible by reducing the amount of 
memory required. Of course, matching compressed files 



53,381 Bl 

2 

presents additional challenges. For CCITT Group 4 com- 
pressed files, pass codes have been shown to contain infor- 
mation useful for identifying similar documents. In one 
prior-art technique, pass codes are extracted from a small 
5 text region and used with the Hausdorff distance metric to 
correctly identify a high percentage of duplicate documents. 
However, calculation of the Hausdorff distance is computa- 
tionally intensive and the number of distance calculations 
scales linearly with the size of database. 

10 

SUMMARY OF THE INVENTION 

A method and apparatus for determining if a query 
document matches one or more of a plurality of documents 
in a database are disclosed. A bit profile of the query 

15 document is generated based on the number of bits required 
to encode each of a plurality of rows of pixels in the 
document. The bit profile is compared against bit profiles 
associated with the plurality of documents in the database to 
identify one or more candidate documents. Endpoint fea- 

20 tures are identified in the query document and a set of 
descriptors for the query document are generated based on 
locations of the endpoint features. The set of descriptors 
generated for the query document are compared against 
respective sets of descriptors for the one or more candidate 

25 documents to determine if the query document matches at 
least one of the one or more candidate documents. 

Other features and advantages of the invention will be 
apparent from the accompanying drawings and from the 

3Q detailed description that follows below. 

DESCRIPTION OF THE DRAWINGS 

The present invention is illustrated by way of example 
and not limitation in the figures of the accompanying 
35 drawings in which like references indicate similar elements 
and in which: 

FIG. 1 illustrates an overview of two-stage document 
matching according to one embodiment; 

FIG. 2 illustrates the coarse matching stage according to 
40 one embodiment; 

FIG. 3 illustrates a query document and its corresponding 
bit profile, bandpass filtered profile, phase group delay graph 
and power spectrum density; 
45 FIG. 4 illustrates commonly seen deformations between 
two matching document images and their corresponding 
power spectrum densities; 

FIG. 5A illustrates an example of reference points used in 
CCITT Group 4 encoding for pass mode encoding; 
50 FIG. 5B illustrates an example of reference points used in 
CCITT Group 4 encoding for horizontal mode encoding; 

FIG. 6 illustrates differences between pass codes and 
endpoints; 

FIG. 7A illustrates down endpoint extraction according to 
55 one embodiment; 

FIG. 7B illustrates up endpoint extraction according to 
one embodiment; 
FIG. 8 illustrates a set of endpoints after skew correction 
60 and a corresponding horizontal projection, local maxima of 
projection and matching local maxima; 

FIG. 9 illustrates an exemplary set of endpoints located 
within a pair of text line segments; 

FIG. 10 illustrates quantization of distances between 
65 consecutive endpoint markers; 

FIG. 11 is a table that summarizes coarse matching recall 
rates for different values of N; 
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FIG. 12 illustrates examples of correctly and incorrectly 1. Coarse Matching Stage 

matched images; The primary goal of the coarse matching stage 15 is to 

FIG. 13 is a table that summarizes results of detailed produce a set of candidates with a high recall rate. Therefore, 

matching in a database using different numbers of consecu- the features used must be easy to compute and robust to 

tive distances per descriptor; and 5 common imaging distortions. The most obvious feature 

FIG. 14 is a block diagram of a processing system that can available without decompression is the compressed file size, 

be used to perform processing operations used in embodi- Unfortunately, compressed file sizes can vary significantly 

ments of the present invention. between matching documents due to inconsistent halftone 

DETAILED DESCRIPTION f^*? " ' he od ® nai f d Ptotoeopied ™ a S<* or the 

io halftoning effects near edges of documents. Almost as 

A two-stage approach to detecting duplicates of com- accessible but much more informative is the compressed 

pressed documents is disclosed herein in various embodi- s i ze 0 f eac h scanline. 

ments. Although the embodiments are described primarily in piG. 2 illustrates the coarse matching stage according to 

terms of CCITT Group 4 compressed documents, the inven- one embodiment. Initially, a one pass scan through a G4 or 
tion is not so limited and may be applied to other types of 15 otherwise compressed query image 12 produces a compres- 

documents, including, but not limited to, CCITT Group 3 s i 0Q bit profile 25. Spectral analysis techniques are then 

compressed documents and TIFF formatted files. applied to the bit profile 25 to generate robust global 

Terminology statistics. The global statistics of the query image 12 are 

The following terms, phrases and acronyms appear compared to precomputed global statistics 27 of document 

throughout this specification. Unless a different meaning is 20 images in the database to generate a set of initial candidates, 

clear from context, the following definitions apply: The precomputed bit profiles 29 of the initial candidates are 

Recall Rate: percentage of correct matches in a database that cross correlated against the bit profile 25 of the query image 

are returned. 12 to produce a hypothesis that includes set of ranked 

MMR: Modified Modified Relative Element Address Des- candidates 17. Further processing may be avoided if a highly 

ignate. 25 confident match is found by cross correlation. 

Text Concentration: lines of text per unit area (e.g., 5 lines 1.1 Bit Profile Extraction 

per inch). The Group 4 compression standard defines a two- 

CCITT: Consultative Committee for International Telegraph dimensional, run-length based coding scheme (MMR) in 

and Telephone. which each scanline is encoded relative to the line above. 

TIFF: Tagged Image File Format. 30 Depending on the patterns of consecutive runs on these two 

G3 or Group 3 Compression: document image compression lines, the appropriate Huffman codes are generated. Because 

technique described in CCITT Specification T.4 MMR coding is deterministic, the same image pattern will 

G4 or Group 4 Compression: document image compression produce a similar compression ratio regardless of its location 

technique described in CCITT Specification T.6 in a document. Consequently, a useful feature to compute is 

Ground Truth Information: known correct information 35 the number of bits required to encode each row of pixels. In 

Wavelet: statistical feature that describes the shape of an general, halftones require the most bits for encoding; texts 

image require fewer bits, and background even fewer. For images 

Document Image: digital image of a sheet of paper or similar which are text-dominant and oriented horizontally, the bit 

medium profiles should show peaks and valleys corresponding to text 

Scanline: a row of pixels in a document image. 40 lines. An example of the peaks and valleys that result from 

Halftone: simulation of gray-scale image using resolute text lines is shown by region 41 of the bit profile in FIG. 3. 

black and white dots The bit profile 25 has been generated from a compressed 

Down Sample: a technique for reducing resolution by aver- version of the document image 40. 

aging or otherwise combining multiple pixels into a single In contrast to the horizontal proj ection of ink density (e.g., 

pixel. 45 average number of black pixels in each line), the bit profile 

Huffman Codes: bit codes for encoding runs of pixels. shows where the information actually is. For example, a 

FIG. 1 illustrates an overview of two-stage document large black region often encountered at edge of photocopied 
matching according to one embodiment. In a coarse match- documents (e.g., region 43 in FIG. 3) will have little effect 
ing stage 15, a compressed file 12 or other query document on the bit profile 25, whereas a large peak will be produced 
is scanned to produce a bit profile. Global statistics such as 50 in an ink density profile. In fact, the bit profile 25 will not 
line spacing and text height are calculated from the bit look much different if the page is in reverse video, 
profile and used to narrow the field of documents to be Moreover, the bit profile 25 conveys more structural infor- 
searched in an image database 14. The bit profile is then mation about the distribution of inks on a scanline than does 
cross-correlated with precomputed bit profiles of documents an ink density profile. For a set of point sizes commonly 
in the search space to identify candidates 17 for a detailed 55 occurring in documents, the compression ratio (in normal- 
matching stage 20. If multiple candidates 17 are generated ized units) for full page-width text lines is quite consistent, 
in the coarse matching stage 15, a set of endpoint features is making them distinguishable from halftones, whereas text 
extracted from the query document for detailed matching in and halftones can have similar ink densities, 
the detailed matching stage 20. Endpoint features contain 1.2 Hypothesis Generation 

sufficient information for various levels of processing, 60 In many cases, the bit profile carries too little information 
mcluding page skew and orientation estimation. In addition, to uniquely identify a single document. However, duplicate 
endpoint features are stable, symmetric and easily comput- documents will usually have similar profiles. Direct corn- 
able from commonly used compressed files including, but parison of bit profiles based on distance calculation can fail 
not limited to, Group 4 compressed files. Endpoint features due to even small vertical translations of the bit profiles 
extracted in the detailed matching stage 20 are used to 65 relative to one another. Therefore, in at least one 
correctly identify a matching document 21 in a high per- embodiment, cross correlation is used. Cross correlation of 
centage of cases. profile vectors can be efficiently computed as products of 
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their Fourier transforms. Cross correlation also produces a 
vertical registration which may be useful for identifying 
corresponding sections in a pair of images for local feature 
extraction. To further reduce the computational cost, global 
document statistics are calculated and used to confine the 5 
search space. 

Several global statistics can be extracted from bit profiles. 
The periodic nature of bit profiles suggests that spectral 
properties will be more useful than statistical moments. The 
dominant line spacing, the number of text lines and the 10 
location of the text provide a good first level characterization 
of a document, and these statistics can be readily extracted 
from bit profiles in the spectral domain. In one embodiment, 
the Power Spectrum Density (PSD) is used to analyze the 
frequency constituents of a bit profile. FIG. 3 shows a PSD 15 
45 that has been calculated from the bit profile 25. The 
dominant line spacing of the query image 40 (or compressed 
version thereof) can be directly calculated from the highest 
peak 47 in the PSD 45 Although spectral analysis does not 
provide a quantitative measure of the number of text fines in 20 
the query image 40, the energy under peak frequency 
(shown by arrow 49) in the PSD 45 is a good indication of 
the amount of text on the page. In one embodiment, the 
location of the text lines in the query image is estimated by 
applying a bandpass filter, centered at the dominant line 25 
spacing frequency, to the bit profile 25. The filtered signal 
will have large amplitude at text locations, as shown by text 
energy profile 51. Sections of the bit profile which are linear 
in phase correspond well to text blocks, as shown by the 
constant, low valued regions 55 in the phase group delay 30 
graph 53 (plotted in radians). In one embodiment, a centroid 
59 of the text energy profile 51 and the width of 90% energy 
span 60 are used as an estimation for text location and 
concentration. In one embodiment, the text location and 
concentration, along with peak frequency and total text 35 
energy, are used as global statistics to define a search 
window in the space of database images. Other global 
statistics or different combinations of global statistics may 
be used in alternate embodiments. 

1.3 Feature Analysis 40 

At this point, it is worth discussing the robustness of the 
bit profile feature and global statistics with respect to various 
deformations. FIG. 4 illustrates some commonly seen defor- 
mations between two matching document images. A useful 
observation in analyzing these problems is the following: if 45 
the two pages are not skewed relative to each other, then the 
bit profile of the noisy image contains the bit profile of the 
clean image superimposed, by addition, with the bit profile 
of everything else on the page. As mentioned, large, uni- 
formly black regions 61, 63 at the top and side of the page 50 
have little effect on the bit profile. However, the bit profile 
can be altered significantly by gray regions dithered as 
halftones (e.g., region 65). Halftones at the top or bottom of 
the page appear as isolated peaks in the bit profile, and they 
can be detected and removed because their local averages 55 
are too high to be text lines. Halftones along the length of the 
page add random noise to the bit profile. However, these 
random noises are usually quite uniform in density and do 
not significantly affect the PSD. Extraneous text 65 on the 
side of the page can have more dramatic effect on the PSD, 60 
however. The text energy for the side content 65 will either 
be absorbed into that of the body text, when their line 
spacings are the same, or produce a separate, usually smaller 
peak, when their fine spacings differ. 

One of the most serious defects is skew caused by rotation 65 
of the image on the page. Rotation 67 has the effect of 
locally averaging horizontal projections in bit profiles, mak- 
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ing the peaks and valleys less prominent. The larger the 
skew, the greater the effect of smoothing. Although smooth- 
ing the bit profile does not change the dominant frequency, 
it does change the energy distribution, pushing the energy 
under the peak frequency towards lower end of the 
spectrum, and making detection much harder. This is illus- 
trated by the PSD 69 of FIG. 4 in which power densities 72 
and 74 are plotted for unskewed image 71 and skewed image 
73, respectively. In one embodiment, a preprocessing step is 
applied to the profile before spectral estimation to remove 
low frequency energies. Overall, the global statistics are 
quite robust to image distortions. However, cross correlation 
of bit profiles is less tolerant to these distortions. 

There are other relevant factors such as resolution and 
encoding formats that contribute to variations in the bit 
profiles. As result of the two-dimensional encoding and 
fixed Huffman coding tables used in Group 4 compression, 
the number of bits required for compression does not scale 
linearly with the length of runs. Although a change in 
horizontal resolution (e.g., caused by magnification) does 
not have a constant scaling effect across the bit profile, the 
residual errors tend to be negligible. Down sampling the bit 
profile, which adjusts for different vertical resolutions, also 
helps to reduce local variations. 

While G4 compressed images have been emphasized, the 
implications of the G3 compression scheme are worth 
mentioning. The differences between the bit profiles of a G3 
and a G4 encoded file of the same image reside in the 
one -dimensional (ID) and two-dimensional (2D) coded 
scanlines. Since 2D coding is usually more efficient that ID 
coding, the G3 encoded bit profile is the G4 encoded bit 
profile plus a periodic waveform with a frequency of k, 
where k is the frequency of ID coded lines. In Group 3, the 
recommended settings for k is every 2 lines at 200 dots per 
inch (dpi) and every 4 lines for higher resolutions. In 
practice, the differences tend to be small relative to the peak 
heights and these frequencies are usually too high to be 
confused with the actual fine spacing. This type of periodic 
noise can also occur, independently of the encoding scheme, 
in TIFF formatted files, where images are often encoded as 
fix sized strips to facilitate manipulation. As a result, the first 
row of each strip is effectively ID encoded. Varying the 
RowsPerStrip parameter setting will produce a correspond- 
ing change in the PSD. As with the periodic waveform of 
frequency k in G3, the noise caused by effective ID coding 
in TIFF formatted files is relatively small compared to the 
peaks produced by text lines and can be neglected. 
2. Detailed Matching 

Because visually different documents can have similar 
compression bit profiles, a second stage process may be 
necessary to resolve any uncertainty in the list of candidates 
produced by the coarse matching stage. According to one 
embodiment, more information is obtained by extracting a 
set of endpoint features from the G4 or otherwise com- 
pressed query image. After analysis, a subset of these 
endpoint features are identified as markers. Descriptors 
based on the positions of these markers are generated for 
document indexing. Cross validation is carried out if a set of 
document candidates are provided by the coarse matching 
procedure. The following sections describe endpoint feature 
extraction and descriptor generation in further detail. 
2.1 Endpoint Extraction 

To facilitate an understanding of endpoint feature 
extraction, it is helpful to briefly discuss the Group 4 
compression scheme. In the Group 4 compression format, 
each scan line is encoded with respect to the line above. 
Referring to FIGS. 5A and 5B, the starting points for two 
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consecutive pixel runs, referred to as changing elements, on if ((pixel(aO)« WHITE) and (al>b2)) 

both lines are identified at any time with respect to the { 

current encoding point, aO. Based on the relative positions down en dpoint at (bl+(b2-bl+l)/2, r-1) 

among these changing elements, one of three possible move aO to b2 

modes, horizontal, vertical or pass mode, is selected for 5 , 

encoding. After encoding, aO is moved forward and the / 

process is repeated. This is indicated by arrows 81 and 83 in if ((pixel(aO)«= WHITE) and (bl>a2) and (bO<al)) 

FIGS. 5A and 5B, respectively. During decoding, the { 

decoded mode is used in combination with known changing up endpoint at (al+(a2-al+l)/2, r) 

elements (aO, bl, b2) to determine positions of al and/or a2. move aO to a2 

Element aO is then moved forward as with encoding. 1 

Therefore, the mode information is decoded first. Positions £ r ♦ , ™^ „ * « • , • , ^ , * 

of the changing elements are also maintained at all times. * efc ™ g L° FIG 1A >. a0 ***** al < ! ccu ? ^ b2 ' 

Pass codes occur at locations that correspond to bottom of Thus > th f £ rst conditional statement in the above 

strokes (white pass) or bottom of holes (black pass). For pseudocode listing is satisfied and a down endpomt 96 is 

Roman alphabets, these feature points occur at the end of a 15 therefore specified on the r-1 line at a point approximately 

downward vertical stroke or the bottom of a curved stroke, midway in the run from b2 and bl. Element aO is advanced 

as shown by the pass code diagram 87 in FIG. 6 (each bold to a location in line r beneath b2, as shown by arrow 101. 

square dot 88 indicates a pass code location). The alignment Referring to FIG. 7B, aO is white, bl occurs after a2, and 

of pass codes near baselines and the structural information bl occurs before al. Consequently, the second conditional 

they carry make them useful in a variety of tasks such as 20 statement in the above pseudocode listing is satisfied and an 

skew estimation and text matching. Equally important is the up endpoint 95 is therefore specified on the r line at a point 

fact that they can be extracted easily from a Group 4 approximately midway in the run from al to a2. Element aO 

compressed file. is advanced to a2, as shown by arrow 103. 

While pass codes are useful, they also have limitations. Endpoints have several advantages over pass codes. First, 

First, pass codes are unstable in the sense that while all white ^ en d po ints are more stable; the same feature points will not 

pass codes correspond to bottom of strokes, not all bottom 5e obscW ed by different encoding modes. Also, endpoints 

of strokes are represented by pass codes Because of the ide information ^und both the x-height line and 

context-dependent nature .of Group 4 encoding modes, iden- baseJine of a ^ ^ This allows for information such as 

tical local patterns of changing elements can be encoded ♦ * u • u« a a * u * * a 

different! height, page orientation and ascenders to be extracted. 

For example, the black run 84 starting at bl in FIG. 5A 30 ™ e f?f etri * n f re of u ? ^ down e °*^ te 15 ^ 

will produce a pass code. However, the black run 85 starting beneficial in dealing with inverted pages. If the page is 

at b2 in FIG. 5B will not generate a pass code, as it would inverted, the endpoints for the correctly oriented page can be 

if al were to shift one pixel to the right. Instead, the bottom obtained by switching the up and down endpoints followed 

of a stroke 86 at b2 is completely shadowed by the hori- b Y a simple coordinate remapping. It is not necessary to 

zontal mode encoding which spans aO to a2. 35 re-scan the compressed document. By contrast, because pass 

Another limitation of pass codes is that they are asym- codes are asymmetric, it is usually necessary to invert the 

metric. While the bottom of a stroke or a hole may be image and recompress to obtain the corresponding feature 

captured by a pass code, pass codes yield no information points. In addition, endpoints are detected based on relative 

about the top of the stroke or hole. As illustrated in pass code positions of changing elements, so there positions are as 

diagram 87 of FIG. 6, for example, the bottom of a "d" often 40 easy to calculate as pass codes, 

contains two pass codes, one white 89A and one black 89B, 2.2 Document Indexing 

while no feature point on the top of the character is captured. Following feature extraction, the two dimensional end- 

Because of the limitations of pass codes, in at least one point information is converted to a one dimensional repre- 

embodiment of the detail matching stage, endpoint features sentation for efficient indexing. Several operations are 

are extracted directly from the changing elements in a 45 involved in this conversion. First, page skew is estimated 

compressed query image. Two types of endpoints are and corrected based on the endpoints. The smoothed hori- 

extracted: up and down endpoints. Down endpoints are zontal projection profiles for the skew corrected up and 

bottoms of strokes, similar to what white pass codes capture. down endpoints, which will be referred to as U profile and 

However, an important difference between down endpoints D profile, are used to locate text lines. Because x-height lines 

and pass codes is that down endpoints are extracted by so must be above their corresponding base fines, the D profile 

directly comparing the positions of changing elements al must lag behind the U profile. The maximum correlation 

and b2, eliminating the possibility of obscurity by horizontal between the U profile and D profile is calculated within an 

encoding. Thus, in contrast to pass codes, all bottoms of offset constrained by the dominant line spacing, which is 

strokes are down endpoints and vice-versa. The tops of obtained from spectral analysis of the profiles. In the cor- 

strokes are similarly extracted as up endpoints using chang- 55 related profile, wherever a local maximum in the U profile 

ing elements a2 and bl. An endpoint diagram 94 in FIG. 6 matches up with a local maximum in the D profile, separated 

illustrates the features captured by up and down endpoints, by a distance equal to text height, there is a good possibility 

and is positioned beneath pass code diagram 87 to illustrate that a text line is located. To improve on the correlation 

the differences between features captured by pass codes and between the U and D profile, all but the local maximum in 

by endpoints. The endpoint diagram 94 also illustrates that 60 the U and D profiles are zeroed within a range just short of 

down endpoints 96 align primarily at the baseline of a text twice the line spacing. This tends to filter out all but the 

fine, while up endpoints 95 align primarily at the x-height x-height fines from the U profile and the baselines from the 

fine 97 (an x-height line is a line determined by the top of D profile. Correlation is then performed on the profile of 

a lower case V). local maxima. FIG. 8 illustrates a set of endpoints 109 that 

FIGS. 7A and 7B are provided along with the following 65 have been extracted from a query image and skew corrected; 

psuedocode to illustrate the manner in which up and down a horizontal projection of up and down endpoints 112 (the 

endpoints are identified. down endpoints are the negatively projecting values 114 that 
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form the D profile); local maxima of U and D profile query. In one formulation, the weight for each descriptor is 

projection 115; and matching local maxima of U and D inversely proportional to the number of documents it 

profiles 117. indexes. For example, suppose N is 5 in the example of FIG. 

Given a set of text line locations, the endpoints within 10, then K-(number of distance indicators-N)+l-(19-5)+ 

each text line zone are extracted. Because the endpoints 5 1=15 sequences Si-S 15 will be generated as follows: 
within a given text line can be used to locate the x-height line 

and baseline, regions within the text line called ascender and 5j = (1 11 13 4 2) 
descender zones can be defined. FIG. 9 shows an image 

region 125 containing two text line segments 127, 129 and ft ■ (H. 13, 4, % 2) 

the up and down endpoints contained within each segment. 10 s ^ _ (13> 4 2> 2, 2) 
The corresponding endpoint map 131 includes text line 
boundaries 133A-133C shown in solid lines, and ascender 

zones 135A, 135B and descender zones 137A, 137B delim- StJ = ^ 8 , s, 0, -1 1) 
ited by dashed lines. Up and down endpoints are represented 

in the endpoint map 131 by upward pointing triangles 141 15 

and downward pointing triangles 143, respectively. By weighting each sequence in inverse proportion to the 
Several observations can be made from the illustration in number of documents it indexes, each of the K sequences, 
FIG. 9. First, significant information can be deduced from S t -, contributes a score of 1/(K*H) to every one of the M £ 
the relative positions of up and down endpoints. For documents that S ( indexes. Documents that receive scores 
examples, diacrits such as dots and "i"s and "j"s are well ~ n 8 reater ^ m a threshold are returned by the detailed match- 
indicated by the presence of both up and down endpoints at ™& sta S c - Large N values will produce fewer, more unique 
the same x location in the ascender zones 135A, 135B (e.g., descriptors, but longer sequences are also more susceptible 
as shown by arrow 145). Moreover, up endpoints in the io * no ™' Alternative formulations for weight- 
•jji \a*k 11 . j • nig descriptors and for selecting descriptors may be used 
middle zone 147A, 147B usual y represent upwa^ curves in * ^ from ^ ^ of m / t mven Uon. 
characters such as "e , "s" and "t '. Character "c is reflected ^ 3 Experimental Results 

by two opposmg down and up endpomts in the middle zone Experiments have been conducted on a database of 979 

(e.g., as shown by arrow 146). document images. Of the 979 images, 292 images (146 

According to one embodiment, sequences of endpoints pairs ) have a ma tching counterpart. Each of the 292 images 

extracted from a relatively small text region are used to is u^d as a query for retrieving its counterpart from the 

provide an index for document matching. With well-defined 30 remaining 978 i mag es. The coarse and detailed matching 

reference lines, there are several possibilities to encode procedures were tested independently as well as in combi- 

endpoints as sequences. From visual inspection, it can be Qation Results on each experiment will be presented, 

observed that endpoints occurring inside the x-height zone 3^ Coarse Matching 

are more susceptible to noise due to touching, In one implementation of the coarse matching algorithm, 

fragmentation, serifs and font style variations. Therefore, in 35 ^ Q original bit profile obtained at the vertical image reso- 

one embodiment, endpoints in the middle zones are ignored ktion ^ down sarrjp i ed by averaging to 36 dpi. This implies 

and only up endpoints above the x-height fine (i.e., in the ^ smallest detectable fine spacing is 4 points. At this 

ascender zone) and down endpoints below the baseline (i.e., resolution, the bit profile is smooth enough and yet provides 

in the descender zone) are used as markers. Endpoints from sufficient details for index calculation and profile correla- 

other regions of a text line may be used as markers in ^ tion Dur in g spectral analysis, the dominant line spacing is 

alternate embodiments. searched only in the frequency between 8 and 36 points. The 

In one embodiment, sequences of quantized distances profile vaJues are norm alized to bits per inch at 300 dpi 

between consecutive markers are used as descriptors. The (horizontal), and quantized to 8 bits. The sample depth at 8 

quantization of distances between consecutive markers is 5its is found exp erimentally based on the observation that, 

illustrated in FIG. 10. Positive values are used to indicate 45 ^ various font slyles> 8 point texts requir e i ess than 

distances between up endpoint markers (i.e., "ascenders") 2Q0 bits per ^ for com p ression at 30 Q dpi. Scanlines 

and negative values are used to indicate distances between exceeding an average of 255 bits per inch usually contain 

down endpoint markers (i.e., "descenders"). The left-most halftones. Profiles obtained at other image resolutions are 

endpoint in each text line region is used as a reference point ( after being vertically resampled at 36 dpi) first scaled 

149A, 149B. Other reference points and distance formats 50 proportionally then quantized. No special adjustments are 

may be used in alternate embodiments. To maintain the made for Group 3/Group 4 encoded files or the strip size in 

two-dimensional structure, distance indicators across text jjpp formal 396 bytes of data ^ pro duced for a 

fines are concatenated, separated by a 0. Hence, a string of typical g 5xll ^ page (n inchx36 dpix8 bils ). 

positive and negative values will be generated for given recall rates for me top N choices are summarized in 

fines of text. For example, for the lines of text shown in FIG. 55 ^ taWe of FIG n corre lation of the bit profiles 

10, the string of distance indicators will be: produced 86% correct on top choice, and 91% correct on top 

1, 11, 13, 4, 2, 2, 2, 4, 0, -39 3 choices. Using the global statistics for indexing, the 

0, average number of candidates for cross correlation calcula- 

5, 7, 7, 4, 8, 6, 0, -11 tion is reduced by 90% without any loss in the recall rate. 

Other formats for marker distances may be used in 60 The Discrete Fourier Transform of the bit profiles for images 

alternate embodiments. For example, marker distances in the in the database are precomputed and stored, so cross corre- 

ascender and descender zones can be interleaved in strictly lation can be calculated by a vector product. Therefore, each 

left to right order. image query involves extracting the bit profile, filtering by 

In one embodiment, each document in the database is global statistics, followed by approximately 100 vector 

reverse indexed by descriptors formed by respective 65 products of dimension 396. 

sequences of N consecutive distances. Similarly, K Examples of correctly and incorrectly matched docu- 

sequences of N consecutive distances are formed during a ments are found in FIG. 12. The correctly matched cases 
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demonstrate the robustness of the features in coping with selection scheme. Generating descriptors for each located 

deformations discussed above. Most errors resulted from text line will increase the database size and reduce recalled 

skewed images or images containing halftones. Although the precision. One possibility for identifying candidate regions 

quality of halftones does not affect indexing, the quality may is to base the selection on local feature point densities. Other 

significantly affect profile correlation. In addition, problems 5 techniques may also be used without departing from the 

may be caused by multiple-column pages, especially scope of the present invention, 

multiple-column pages that contain halftones in one column 3.3 Combined Solution 

and text in the other. Non-collinear columns can lead to In the combined test, the result of the coarse matching 

aliasing and incorrect line spacing estimation. Two pairs of stage is returned if the correlation score of the top choice is 

images have scale differences. 10 greater than 0.85 and the difference between the top and 

It has been observed that line spacing and text energy second choice score is more the 0.03. Otherwise, the top 

indices are much more effective in constraining the search twenty choices are passed on for detailed matching. As a 

space than text location and text extent indices. This is result, 70% of the images are accepted after coarse 

expected because the Fourier transform is poor in spatial matching, and only 30% of the images require detailed 

localization. However, good frequency isolation is important 15 matching. The overall correct rate for the system is 93.8%. 

for discriminating the densely distributed line spacing Thus, results indicate that coarse matching by profile cor- 

between pointsize 9 and 12. To improve on text location, a relation not only improves execution efficiency, but also 

wavelet transform may be used. eliminates candidates which otherwise would be confused 

3.2 Detailed Matching by detailed matching alone. Different results will be 

In a detailed matching experiment, endpoints were 20 achieved by modifying the decision rule. In most cases, the 

extracted from a 1.5 by 1 inch region from the first body of detailed matching stage should be invoked to improve the 

text in the image using the ground truth information. The reliability of detection, 

text line location algorithm was then applied to detect 4. Overview of Processing System 

endpoints in the ascender and descender zones. Although FIG. 14 is a block diagram of a processing system 150 that 

some of those regions contained non-text portions of the 25 can be used to perform processing operations used in 

image, the line location technique was relied upon to elimi- embodiments of the present invention. The processing sys- 

nate any feature points not belonging to text lines. After the tem 150 includes a processing unit 151, memory 153, 

ascender and descender zones were defined, a sequence of display device 155, cursor control device 157, keypad 158, 

distances between endpoint markers was generated for each and communications device 159 each coupled to a bus 

patch. Taking every N consecutive distances as an index, 30 structure 161. The processing system 150 may be a desktop 

multiple descriptors were constructed for a database query. or laptop computer or a workstation or larger computer. 

Using the weighting scheme described above, the image Alternatively, the processing system 150 may be a copy 

receiving the highest score was selected. system, facsimile system, or other electronic system in 

Each of the 292 images was used to query into the full set which it is desirable to process compressed document 

of 979 images. The images themselves are recalled 100% of 35 images. The cursor control device 157 may be a mouse, 

the time. In 290 of the 292 cases, only the image itself is trackball, stylus, or any other device for manipulating ele- 

retrieved as the top choice. In two cases, one additional ments displayed on display device 155. The keypad 158 may 

image was recalled with a tied score. For duplicate be a keyboard or other device to allow a user to input 

detection, a case is considered correct if the counterpart alphanumeric data into the processing system 150. Other I/O 

scores highest among the rest of the images without any ties. 40 devices 163 may be present according to the specific func- 

The results for different values of N are summarized in the tions performed by the processing system 150. 

table shown in FIG. 13. The processing unit 151 may include one or more general 

Using sequences of three, four and five distances, 92.5% purpose processors, one or more digital signal processors or 

of the duplicate are correctly detected. This performance is any other devices capable of executing a sequence of 

comparable to results achieved using more computationally 45 instructions. The processing unit 151 may also be distributed 

intensive techniques. In addition, the indexing approach has among multiple computers of the processing system 150. 

much greater scalability than the distance based strategy. When programmed with native or virtual machine 

Most of the mistakes are due to noise in the feature points. instructions, the processing unit may be used to carry out the 

Because the projection profile based text line location above-described coarse matching stage and detailed match- 
technique relies on collinearity of text fines across the width 50 ing stage operations. 

of the page, the performance of the technique is affected by The communications device 159 may be a modem, net- 
misaligned multiple-column documents. One solution to this work card or any other device for coupling the processing 
problem is to use vertical projection profile for column system 150 to a network of electronic devices (e.g., a 
segmentation. Another solution is to perform text line loca- computer network such as the Internet). The communica- 
tion within vertical slices of the document, and use only the 55 tions device may be used to generate or receive a signal that 
high confidence results to avoid column boundaries. is propagated via a conductive or wireless medium. The 
Spurious feature points occurring beyond text line bound- propagated signal may be used, for example, for contacting 
aries can generate false descriptors. Some measures for sites on the World Wide Web (or any other network of 
detecting the horizontal extent of text lines may be provided. computers) and for receiving document images, updated 
Because the feature points have been skew corrected and the 60 program code or function-extending program code that can 
positions of the x- height lines and baselines are known, the be executed by the processing unit to implement embodi- 
ends to fine segments may be found based on the endpoint ments of the present invention. 

profiles discussed above. Furthermore, the regions for In one embodiment, the memory 153 includes system 

descriptor generation can be automatically determined. In memory 166, non-volatile mass storage 167 and removable 

the experiment, ground truth information was used to iden- 65 storage media 168. The removable storage media may be, 

tify corresponding text regions in document images. This for example, a compact disk read only memory (CDROM), 

registration process may be replaced by an automatic region floppy disk or other removable storage device. Program 
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code, including sequences of instructions for performing the 
above-described coarse matching stage and detailed match- 
ing stage operations, may be stored on a removable storage 
media that can be read by the processing system 150 and 
used to operate the processing system in accordance with 5 
embodiments described herein. The non-volatile mass stor- 
age 167 may be a device for storing information on any 
number of non-volatile storage media, including magnetic 
tape, magnetic disk, optical disk, electrically erasable pro- 
grammable read only memory (EEPROM), or any other 10 
computer-readable media. Program code and data and pro- 
gram code for controlling the operation of the processing 
system in accordance with embodiments described herein 
may be transferred from the removable storage media 168 to 
the non-volatile mass storage 167 under control of an 15 
installation program. A database of document images may 
also be maintained in the non-volatile mass storage 167. 

In one embodiment, when power is applied to the pro- 
cessing system 150, operating system program code is 
loaded from non-volatile mass storage 167 into system 20 
memory 166 by the processing unit 151 or another device, 
such as a direct memory access controller (not shown). 
Sequences of instructions comprised by the operating sys- 
tem are then executed by processing unit 151 to load other 
sequences of instructions, including the above-described ^ 
program code for implementing the coarse and detailed 
matching stages, from non-volatile mass storage 167 into 
system memory 166. Thus, embodiments of the present 
invention may be implemented by obtaining sequences of 
instructions from a computer-readable medium, including 30 
the above-described propagated signal, and executing the 
sequences of instructions in the processing unit 151. 

Having described a processing system for implementing 
embodiments of the present invention, it should be noted 
that the individual processing operations described above 35 
may also be performed by specific hardware components 
that contain hard-wired logic to carry out the recited opera- 
tions or by any combination of programmed processing 
components and hard-wired logic. Nothing disclosed herein 
should be construed as limiting the resent invention to a ^ 
single embodiment wherein the recited operations are per- 
formed by a specific combination of hardware components. 

In the foregoing specification, the invention has been 
described with reference to specific exemplary embodiments 
thereof. It will, however, be evident that various modifica- 45 
tions and changes may be made to the specific exemplary 
embodiments without departing from the broader spirit and 
scope of the invention as set forth in the appended claims. 
Accordingly, the specification and drawings are to be 
regarded in an illustrative rather than a restrictive sense. 50 

What is claimed is: 

1. A method of determining if a query document matches 
one or more documents in a database, the method compris- 
ing: 

identifying up endpoints and down endpoints in the query 55 
document, the up endpoints representing tops of fea- 
tures in the query document and the down endpoints 
representing bottoms of features in the query docu- 
ment; 

generating a set of descriptors for the query document eo 
based on locations of the up endpoints and the down 
endpoints; 

comparing the set of descriptors for the query document 
against respective sets of descriptors associated with 
the one or more documents in the database to determine 65 
if the query document matches at least one of the one 
or more documents; 
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wherein generating a set of descriptors for the query 
document based on locations of the up endpoints and 
the down endpoints comprises 

identifying text lines in the query document based on 
concentrations of up endpoints and down endpoints 
along scanlines of the query document; and 

generating the set of descriptors based on distances 
between selected up endpoints and selected down 
endpoints within the text lines in the query docu- 
ment; and 

wherein identifying text lines in the document based ^n ^ 
concentrations of up endpoints and down endpoints ; 
along scanlines of the document comprises: 1 
determining the number of up endpoints and the num- %. 

ber of down endpoints that lie on each of the scan- 1 

lines; and 

identifying respective pairs of scanlines that have a 
local maximum number of up endpoints and a local 
maximum number of down endpoints as text lines. 

2. A method of determining if a query document matches 
one or more documents in a database, the method compris- 
ing: 

generating a bit profile of the query document based on 
the number of bits required to encode each of a plurality 
of rows of pixels in the query document; 

comparing the bit profile of the query document against 
bit profiles associated with a first plurality of docu- 
ments from the database to identify one or more 
candidate documents; 

identifying endpoint features in the query document; 

generating a set of descriptors for the query document 
based on locations of the endpoint features; 

comparing the set of descriptors for the query document 
against respective sets of descriptors- for the one or 
more candidate documents to determine if the query 
document matches at least one of the one or more 
candidate documents; ■ 

performing spectral analysis on the bit profile of the query 
document to determine global statistics of the query 
document and ' 

comparing the global statistics of the query document 
against global statistics associated with a second plu- 
rality of documents from the database to identify the 
first plurality of documents, the first plurality of docu- 
ments being a subset of the second plurality of docu- 
ments. 

3. The method of claim 2 wherein performing spectral 
analysis on the bit profile to determine global statistics 
comprises generating an estimation of at least one of a 
dominant line spacing in the query document, a proportion 
of the query document that is text, a location of text in the 
query document, and a text concentration. 

4. A method of generating a set of descriptors for iden- 
tifying a document, the method comprising: 

identifying up endpoints and down endpoints in the 
document, the up endpoints representing tops of fea- 
tures in the document and the down endpoints repre- 
senting bottoms of features in the document; 

identifying text lines in the document based on concen- 
trations of up endpoints and down endpoints along 
scanlines of the document; and 

generating a set of descriptors based on distances between 
selected up endpoints and selected down endpoints in 
the concentrations of up endpoints and down end- 
points; 
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wherein identifying text lines in the document based on 
concentrations of up endpoints and down endpoints 
along scanlines of the document comprises: 
determining the number of up endpoints and the num- 
ber of down endpoints that lie on each of the scan- 
lines; and 

identifying respective pairs of scanlines that have a 
local maximum number of up endpoints and a local 
maximum number of down endpoints as text lines. 

5. A method of generating a set of descriptors for iden- 
tifying a document, the method comprising: 

identifying up endpoints and down endpoints in the 
document, the up endpoints representing tops of fea- 
tures in the document and the down endpoints repre- 
senting bottoms of features in the document; 

identifying text lines in the document based on concen- 
trations of up endpoints and down endpoints along 
scanlines of the document; and 

generating a set of descriptors based on distances between 
selected up endpoints and selected down endpoints in 
the concentrations of up endpoints and down end- 
points; 

wherein identifying text lines in the document based on 
concentrations of up endpoints and down endpoints 
along scanlines of the document comprises: 
determining a dominant line spacing in the document; 
determining the number of up endpoints and the num- 
ber of down endpoints that lie on each of the scan- 
lines; and 

identifying as text lines respective scanline pairs in 
which the constituent scanlines are separated by a 
distance less than the dominant line spacing and in 
which the constituent scanlines respectively have a 
local maximum number of up endpoints and a local 
maximum number of down endpoints as text lines. 

6. The method of claim 5 wherein the dominant line 
spacing is determined based on spectral analysis of locations 
of the endpoints in the document. 

7. A method of generating a set of descriptors for iden- 
tifying a document, the method comprising: 

identifying up endpoints and down endpoints in the 
document, the up endpoints representing tops of fea- 
tures in the document and the down endpoints repre- 
senting bottoms of features in the document; 

identifying text lines in the document based on concen- 
trations of up endpoints and down endpoints along 
scanlines of the document; 

generating a set of descriptors based on distances between 
selected up endpoints and selected down endpoints in 50 
the concentrations of up endpoints and down end- 
points; and 

generating a respective endpoint profile for each of the 
scanlines, the endpoint profile including a count of up 
endpoints identified on the scanline and a count of 
down endpoints identified on the scanline, and wherein 
identifying text lines based on concentrations of up 
endpoints and down endpoints* along scanlines of the 
document comprises reducing all but local maximums 
of the counts of up endpoints and the counts of down eo 
endpoints in respective endpoint profiles. 

8. A method of generating a set of descriptors for iden- 
tifying a document, the method comprising: 

identifying up endpoints and down endpoints in the 
document, the up endpoints representing tops of fea 
tures in the document and the down endpoints repre 
senting bottoms of features in the document; 
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identifying text lines in the document based on concen- 
trations of up endpoints and down endpoints along 
scanlines of the document; and 

generating a set of descriptors based on distances between 
selected up endpoints and selected down endpoints in 
the concentrations of up endpoints and down end- 
points; 

wherein identifying text lines based on concentrations of 
up endpoints and down endpoints along scanlines of the 
document comprises: 

generating a count of up endpoints and a count of down 

endpoints for each of the scanlines; 
identifying a first scanline within a locality of scanlines 

that has the highest count of up endpoints; 
reducing the count of up endpoints associated with each 

scanline within the locality of scanlines except the 

first scanline; 

identifying a second scanline within the locality of 
scanlines that has the highest count of down end- 
points; and 

reducing the count of down endpoints associated with 
each scanline within the locality of scanlines except 
the second scanline. 

9. The method of claim 8 wherein identifying the first 
scanline within the locality of scanlines that has the highest 
count of up endpoints comprises: 

determining a dominant line spacing of the document; and 
denning the locality of scanlines to be scanlines within a 

range greater than the dominant line spacing but less 

than twice the dominant line spacing. 

10. A method of generating a set of descriptors for 
identifying a document, the method comprising: 

identifying up endpoints and down endpoints in the 
document, the up endpoints representing tops of fea- 
tures in the document and the down endpoints repre- 
senting bottoms of features in the document; 

identifying text lines in the document based on concen- 
trations of up endpoints and down endpoints along 
scanlines of the document; and 

generating a set of descriptors based on distances between 
selected up .endpoints and selected down endpoints in 
the concentrations of up endpoints and down end- 
points; 

wherein generating a set of descriptors based on distances 
between selected up endpoints and selected down end- 
points comprises defining an ascender zone and a 
descender zone for each of the text lines, the selected 
up endpoints being up endpoints in the ascender zone 
and the selected down endpoints being down endpoints 
in the descender zone. 

11. The method of claim 10 wherein defining an ascender 
zone and a descender zone for each of the text lines 
comprises: 

defining a region above an x-height line of a first text fine 
of the text lines to be the ascender zone for the first text 
line; and 

defining a region below the baseline of the first text line 
to be the descender zone for the first text line. 

12. The method of claim 11 wherein the ascender zone of 
the first text line is bounded in part by the descender zone for 
the preceding text line. 

13. A method of generating information that can be used 
to identify a document, the method comprising: 

generating a bit profile based on the number of bits 
required to encode each of a plurality of rows of pixels 
in the document; and 
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performing spectral analysis on the bit profile to deter- 
mine global statistics of the document; 

wherein performing spectral analysis on the bit profile to 
determine global statistics comprises generating an 
estimation of a dominant line spacing in the document; 5 
and 

wherein generating an estimation of a dominant line 
spacing comprises generating a power spectrum den- 
sity from the bit profile and calculating the estimation 
of the dominant line spacing from a peak value in the 
power spectrum density. 

14. A method of generating information that can be used 
to identify a document, the method comprising: 

generating a bit profile based on the number of bits 15 
required to encode each of a plurality of rows of pixels 
in the document; and - 

performing spectral analysis on the bit profile to deter- 
mine global statistics of'the document; 

wherein performing spectral analysis on the bit profile to 20 
determine global statistics comprises generating an 
estimation of a proportion of the document mat is text; 
and 

wherein generating an estimation of a proportion of the 
document that is text comprises generating a power 25 
spectrum density from the bit profile and calculating 
the estimation of the proportion of the document based 
on an energy under a peak value in the power spectrum 
density. 

15. A method of generating information that can be used 30 
to identify a document, the method comprising: 

generating a bit profile based on the number of bits 
required to encode each of a plurality of rows of pixels 
in the document; and 

performing spectral analysis on the bit profile to deter 
mine global statistics of the document; 
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wherein performing spectral analysis on the bit profile to 
determine global statistics comprises generating an 
estimation of a location of text in the document; and 
wherein generating an estimation of a location of text in 
the document comprises: v 
applying a bandpass filter to the bit profile to generate 

a text energy profile; and 
determining a centroid of the text energy profile to be 
the estimation of the location of text in the document. 

16. The method of claim 15 wherein applying a bandpass 
filter to the bit profile comprises: 

determining a dominant line spacing frequency of the 

document; and 
selecting a center frequency of the bandpass filter based 

on the dominant line spacing frequency. 

17. A method of generating information that can be used 
to identify a document, the method comprising: 

generating a bit profile based on the number of bits 
required to encode each of a plurality of rows of pixels 
in the document; and 

performing spectral analysis on the bit profile to deter- 
mine global statistics of the document; 

wherein performing spectral analysis on the bit profile to 
determine global statistics comprises generating an 
estimation of text concentration in the document, the 
estimation of text concentration indicating a lengthwise 
measure of a proportion of the document that is text; 
and 

wherein generating an estimation of text concentration in 
the document comprises: 

applying a bandpass filter to the bit profile to generate 

a text energy profile; and 
determining the estimation of the text concentration 

based on a length of the text energy profile. 
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